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Abstract 

Recently, reliance on social media has been steadily increasing from year to 

year. And as an anonymous medium of communication, people tend to share 
Article Info offensive comments which could be problematic and potentially cause a lot of 

harm to society. In order to find ways of addressing this issue, researching an 
Molume 2, esas 2, May 202 automated method that detects offensive text within social media platforms has 
Received : 13 February 2022 become important. Research in this field within the Arabic language is not as 
Accepted : 09 April 2022 widely available as in other languages. Due to recent breakthroughs in Arabic 


Published : 05 May 2022 Natural Language Processing, we were able to achieve results which are more 
doi: 10.51483/UIDSBDA.2. 1.2022.20-25 | accurate in detecting offensive content within social media. The Arabic language 
is in itself a different challenge compared to English, being a morphologically 


rich language. With the recent breakthrough in transformer based models such 
as BERT, which have been able to achieve state-of-the-art results in various tasks 
and building upon the AraBERT prettraining which has been proven to outperform 
multilingual BERT, as well as utilizing Arabic specific methods of pre-processing, 
we were able to achieve better results than established approaches for this task. 
Specifically, the BERT-base model achieved an F1-score of 84.88% on a multi- 
platform, multi-dialect dataset. 
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1. Introduction 


Offensive speech on social mediahas many adverse effects on users, and automated detection of this speech 
can help to regulate this issue. Research has been much slower in Arabic language NLP tasks relative to 
English or Latin languages. With social media being ableto reach more peoplein theworld quicker than ever, 
as highlighted by users inthe Arab region making up 8.4% of Facebook users (Salem, 2017), research into the 
issuefor the Arabic language is necessary. 

This paper is organized into three main sections. Section || discusses previous work on thetopic, Section III 


isan explanation of the approach we took when addressing the issue. Finally, Section IV shows therresults 
and compares with results obtained by other approaches, as well as an analysis of said results. 


2. Established Solutions 
2.1. Dataset Construction for the D etection of Anti-Social Behaviorin Online Communication in Arabic 


This research paper mainly studies all types of antisocial behavior such as offensive language and cyber- 
bullying. The purpose of this research is to collect data, annotateit, add useful features and givea verdict on 
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whether itwould besuitable for usage in machinelearning modes. They collected A rabic text from YouTube 
comments which serve the purpose of offensive language detection. They also take into account the 
different Arabic dialects in their prediction. Furthermore they also take into consideration misspellings, 
and even text written in various languages. After analyzing the data and while annotating it also adding 
useful features as discussed, they ultimately realized that this data would be useful for usage in machine 
learning models for further research towards the detection of offensive language. 


2.2. D etecting O ffensive Language on Arabic Social M edia U sing Deep Learning 


Inthis paper theauthors look atsolving thesame topic, attempting four different approachesto the problem, mainly 
revolving around deep learning. These were Convolutional Neural Network (CNN), Bidirectional Long Short- 
TermM emory (Bi-LSTM ), Bi-LSTM with attention mechanism, and acombined CN N-LSTM architecture. On all 
four approaches, they trained and tested on thesamedataset of A rabicY ouTubecomments collected by A lakrot & 
al. (2018) and followed thesame preprocessing techniques recommended in thesamepaper. They took it onesteo 
further by applying word embedding. Specifically, they used the AraVec word embedding (Soliman & al., 2017). 
Ultimately the CN N-LSTM architectureproved to get thebest results which wasan accuracy of around 87%. 


2.3. M ultilingual and M ulti-A spect H ate Speech Analysis 


In this paper, they widened the task by including hate speech prediction in three languages (A rabic, 
English, and French) rather than only Arabic for instance. They gathered a dataset of around 13,000 
tweets and after labeling the tweets by using the Amazon Mechanical Turk, acrowdsourcing platform. 
They first tried implementing baseline models such as Logistic Regression which is fed by data from Bag 
of Words. Then they utilized a Bidirectional LSTM, and attempted training on singlelanguages, and on 
multiple languages simultaneously in the same dataset. They concluded that in sometasks a multi language 
model can outperform single language ones. Their best F-score was an 86% average between that of the 
three languages when detecting the directness of a tweet (F1-score of 84%). When it comes to theA rabic 
tweets in the dataset, they were ableto achievean F1-score of 56% when detecting the targeted label, yet it 
must be noted that the labels were either ‘N ormal’ or five other categories of abuse (6 classes). 


2.4. Analysis and Detection of Religious Hate Speech in the Arabic Twittersphere 


In this research paper they seek developing an automated method towards detection of offensive content 
within twitter specifically, mainly in Arabic. They focus specifically on religious hate speech. A fter building 
their own dataset, they moved towards classification models using mainly deep learning models. They first 
moved towards pretraining their word embeddings and they chosethen to apply that on a Recurrent N eural 
Network with Gated Recurrent Units. 


2.5.L-HSAB: A levantine Twitter D ataset for H atespeech and Abusive Language 


Inthis paper, they focuson detection of A rabichatespeech. Theresearch is doneon thelevantinedialect specifically. 
They create a dataset which consists of a combination of datasets rd eased by former research on thetopic, in order 
to geta bigger morerepresentativedataset. For their implementation they then looked atthemost common words 
within their dataset, assigning hatescores to each of thosewords. They then used a Support Vector M achineas ther 
supervised mode to solvethis classification problem and received afinal F1-scoreof 89.6%. 


2.6. Arabic O ffensive Language on Twitter: Analysis and Experiment 


In this piece of research they use another approach, they decided to useSupport Vector M achine Techniques. 
They trained on adataset of 10,000 tweets which they collected by themselves, they annotated their dataset 
with four different classes, those being offensive, vulgar, hatespeech, or clean. |n our implementation we used 
this dataset, however weoperated on binary classes, offensiveor clean only. Astheir pre-processing they used 
the FarasaA rabic NLP toolkit to apply aseries of steos which includetokenization and removing stop words. 
They experimented with classification modes such astheSVM and Logistic Regression coupled with A daBoost, 
however ultimately they decided to usea fine-tuned version of the multilingual variant of BERT, which can 
take almost any languageas input, wherethey added a denselayer and a softmax classifier. 


2.7.A multi-Platform Arabic News Comment D ataset for O ffensive Language D etection 


Herethey look at data from various different social media platforms and filter them out those being: YouTube, 
Reddit, Wikipedia, and Twitter. They combined their own dataset and used crowd sourcing in order to annotate 
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thecomments collected. To preparethe data they first tokenized thetext removing all punctuations, URLsand 
stop words. They used abasic classification approach whichis SVM to producetheir best results. 


2.8. Q uick and SimpleA pproach for D etecting H ate Speech in Arabic Tweets 


In this piece of research they were seeking to create a quick and easy approach to detecting hate speech. They 
tried out various mode!sincluding several classical and neural learning modes. They renoved all punctuation 
and normalized the text by using Aravec word embedding (Soliman et al., 2017). They tried out different 
mode's such as SVM, Gradient Boosting and Logistic Regression. However they ultimately decided to follow 
amorecomplex moda, namely, they used acombination of aCNN and anLSTM for their final approach and 
achieved a fair result of around 73%. 


3. Proposed Solution 


Thesolution proposed by this paper is motivated by solving theshortcomings of previous works attempting 
thesametask. These mainly revolve around three main points, namely, thelack of sufficient data, thelack of 
diversedata, and not deploying state-of-the-art mode's and methodsin many cases discussed in theestablished 
solutions sections. To solve these issues we focused on collecting and aggregating a larger, more diverse 
dataset, applying newly proposed pre-processing methods, and training on the BERT-base modd and fine- 
tuning theAraBERT pretrained weights. 


3.1. Aggregated D ataset 


Using a single dataset of the previously constructed ones in Ousidhoum é al. (2019); Albadi & al. (2018); 
Alakrot é al. (2018); Mubarak et al. (2017); and Chowdhury é al. (2020) would not feasibly train a complex 
model such as BERT-base on thetask at hand. For this, we propose aggregating datasets in Ousidhoum é al. 
(2019); Albadi et al. (2018); Alakrot et al. (2018) and Mubarak et al. (2017) into onelarger, morediversedataset 
that is more representative of the A rabic social medialandscape. These specific datasets wereselected dueto 
the similarity in themethodology of data collection and labeling, and some changes were madeto multi-label 
datasets as the task in this paper is a binary one. 


Theaggregated dataset, after ensuring thehomogeneity of thedata, amounted to a total of 24,242 instances 
for training. This dataset contains multi-platform and multi-dialectical social media posts. A largepart of the 
performance gained is due to the larger dataset, and it would have been a larger gain if it were not for the 
increased complexity of thetask after having added multipledialects to the dataset. M aking this changewhile 
maintaining high accuracy metrics is only possible sinceAraBERT is pre-trained on multi-dialect data. 


3.2. Pre-Processing 


The aforementioned combined dataset was pre-processed to remove specific characters and artifacts that 
could hinder the model performance. Various pre-processing techniques are utilized in English language 
models, however, aoneto oneimport of thesame methods do not typically providethe best results (A ntoun & 
al., 2020). This is because of the Arabic language's system of concatenation, meaning that different words 
could share the same meaning yet have different forms. For example, the “Al” prefix which is equivalent to 
“the” in English is attached to theword, yet it is not a part of its meaning. Thisinherent differencein theA rabic 
language often leads to worse performance when attempting to apply the same techniques used for Latin 
languages. 

To address this issue we used the approach proposed by Abdelali et al. (2016) where words are first 
segmented into stems, prefixes and suffixes. In their paper, A bdelali e al. found that this method was faster 
and moreaccurate than other implementations of A rabic segmentation. This example further illustrates how 
the Farasa segmenter works, wherethis word is broken down into three: 


3.3. Model 


To beableto solvetheproblem at hand, wefine-tuned theAraBERT pre-trained Arabic reoresentation model 
(Antoun et al., 2020) whichis based on BERT-base (Devlin e al., 2018). 


The BERT mode, which has pushed the state-of-the-art since it's introduction, is at its core a series of 
Bidirectional Transformer Encoders that are stacked. The base version, has 12 encoder blocks, 12 attention 
heads and 768 hidden dimensions (Devlin et al., 2018). The maximum sequence capacity for BERT-baseis 512 


Ahmed Fahmy / Int.J.DataSci. & Big Data Anal. 2(1) (2022) 20-25 Page 23 of 25 


words, which enabled us to use the full dataset without altering via sliding window techniques or dropping 
any instances, since all posts werewell below this threshold. Thenumber of parametersin themodd amounted 
to 110M (Devlin & al., 2018). Model pre training is done by both Masked Language Modeling, where a 
randomly selected word is predicted and the objectiveis to predict the masked word via context, aswel asthe 
self describing next sentence prediction. BERT’s bidirectionality allows better understanding of context and 
was found to achievemuch better results on similar tasks (Devlin e& al., 2018). A mode! overview is found in 
Figurel. 


Starl/End Span 


Masked Sentence A Masked Sentence B Question Paragraph 


Unlabeled Sentence A and B Pair Question Answer Pair 


Pre-training Fine-Tuning 


Figure 1: Aspect Ratio 


Theneed for an increased dataset is highlighted sincethenumber of parameters in this model exceeds other 
modeisin previous research. Themodd was implemented using the PyTorch framework with theH uggingFace 
library for Natural Language Processing (NLP). TheA dam optimization algorithm was used with a warm-up 
linear schedule, and an initial learning rate of 2e°. The full sequence length of 512 was utilized, with a batch 
sizeof 8. 


4, Experimental Results 


Figure2 showstheReceiver Operating Characteristic (ROC) curveobtained by training the model.It must be 
noted that 20% of the dataset was used for testing. 


ROC Curve 


True Positive Rate 


— ROC curve 


0.0 0.2 0.4 0.6 0.8 1.0 
False Positive Rate 


Figure 2: Receiver O perating Characteristic Curve 


Below is a comparison of obtained results by previous research discussed in the established solutions 
section. F1-scoreisthe most representative metric and was reported by all papers. 


A moredetailed analysis of theresults of Table 1 must depend on theimplementation details mentioned in 
the Established Solutions section. H alf the papers used for comparison in Table 1 did not report theaccuracy, 
whichis not an issueas F1-scoreis a better metric to compare across different datasets that are unbalanced. It 
must benoted that (Ousidhoum & al., 2019) was included asthedataset produced in the paper was incorporated 
into our dataset, yet their predicted labels brokedown theoffensivetexts into oneof five sub-categories, which 
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explains thelower F1-score. Notably, Mulki & al. (2019) produced the best reported F1-score, yet that is dueto 
the fact that it only focused on a single dialect (Levantine), and is further supported by the usage of less 
advanced techniques, where Naive Bayes produced the best results. The most logical comparison is with 
Mohaouchaneet al. (2019) and Mubarak et al. (2020), with the former training on thedataset by Alakrot e€ al. 
(2018), and the latter using asimilar approach to ours.A s compared with (M ohaouchaneé al., 2019), their best 
results were obtained usingaCNN model withAravec word embedding, with aCNN-+_STM mode coming 
a close second at 83.65%. However, the dataset was still less diverse than the one in this implementation, 
coming from a single social media platform (YouTube) and with the dataset consisting of alow amount of 
Levantineand Egyptian dialect instances relativeto ones from the Gulf (Alakrot et al., 2018), which is bound 
to affect the results. When it comes to Mubarak ée al. (2020), their usage of multilingual BERT which under 
performs relative to a pretraining specific to Arabic, coupled with a significantly smaller dataset (10,000) 
explains the relativeincreasein F1-score, with theimplementation in M ubarak et al. (2020) achieving a 79.7% 
score. Despite this hypothesized challenge, the BERT-base mode! with AraBERT pretraining was able to 
outperform other implementations. This can be attributed to BERT's ability to “understand” contextualized 
information, making it less sensiti veto specific words and morecapable of identifying them in their appropriate 
context. 


Table 1: Reported Metrics Comparison 

Best Reported Metrics 
Paper Accuracy F1-Score 
This Implementation 85 84.88 
(Ousidhoum et al., 2019)* - 56 
(Albadi et al., 2018) 79 78 
(Mulki et al., 2019)** 90.3 89.6 
(Mubarak et al., 2017) - 60 
(Mohaouchane et al., 2019) 87.84 84.05 
(Mubarak et al., 2020) - 79.7 
(Chowdhary et al., 2020) 74 68 
(Abuzayed and Elsayed, 2020) - 73 


Some of the models included in the comparison are added as a baseline, sincethese included thedatasets 
used to construct the one used in this implementation. Specifically thosefound in references (Ousidhoum é al., 
2019; Albadi et al., 2018; M ulki et al., 2019; and Mubarak et al., 2017). 


5. Conclusion 


In this paper, wetackle the problem of automatically detecting offensive language on multi-dialect Arabic 
social media. We built alarger, aggregated dataset from previously labeled datasets with some modifications 
to maintain data cohesion, and madesureto check methodology of labeling so that the dataset remains valid. 
Theperformance of the BERT-base model (Devlin et al., 2018) using AraBERT's pretrained weights as afine- 
tunable starting point (Antoun ée al., 2020). Pre-processing and segmentation were applied using the Farasa 
segmentation process by Abdelali ¢ al. (2016). We were able to achieve an F1-score of 84.88% on a more 
representative, yet more difficult dataset. This is due to the BERT mode's ability for recall and context 
identification, which has been a breakthrough in Natural Language Processing (NLP) tasks. TheAraBERT 
pretraining enabled us to usethat model with positive results. A larger focus was placed on M ohaouchanee 
al. (2019) and Mubarak é al. (2020) in analysis part of Section IV as these were the only implementations to 
attempt a deep learning approach with acomparabledataset for asimilar objective, which points to a lack of 
sufficient research in this important topic. 


For future work, wewould recommend incorporating other metadata from social media posts such as likes, 
replies or otherwiseinto theset of features that would further introducecontext to the mode, which webdieve 
would increasethe model accuracy. 


Ahmed Fahmy / Int.J.DataSci. & Big Data Anal. 2(1) (2022) 20-25 Page 25 of 25 


References 


Abdelali, A., Darwish, K., Durrani, N. and Mubarak, H. (2016). Farasa: A Fast and Furious Segmenter for 
Arabic. In Proceedings of the 2016 Conference of the N orth American Chapter of the Association for 
Computational Linguistics: Demonstrations,11-16. DOI: 110.18653/ v1/ N 16-3003 

Abuzayed, A. and Elsayed, T. (2020). Quick and Simple Approach for Detecting H ate Speech in Arabic 
Tweets. In Proceedings of the 4" Workshop on Open-SourceA rabic Corporaand Processing Tools, with 
aShared Task on OffensiveLanguage Detection, 109-114. 

Alakrot, A., Murray, L. and Nikolov, N.S. (2018). Dataset Construction for the Detection of Anti-Social Behaviour 
in Online Communication in Arabic. Procedia Computer Science, 142, 174-181. 

Albadi, N., Kurdi, M.and Mishra, S. (2018). Arethey our Brothers? Analysis and Detection of Religious H ate 
Speech in theA rabic Twittersphere. In 2018IEEE/ ACM International Conferenceon A dvances in Social 
Networks Analysis and Mining (ASONAM ), 69-76. 

Antoun, W., Baly, F. and Hajj, H. (2020). AraBERT: Transformer-Based Model for Arabic Language 
Understanding. arXiv preprint arXiv:2003.00104. 

Chowdhury, S. A., Mubarak, H., Abdeiali, A., Jung,S. G., Jansen, B.J.and Salminen, J. (2020). A M ulti-platform 
Arabic News Comment Dataset for Offensive Language D etection. In Proceedings of The 12" Language 
Resources and Evaluation Conference, 6203-6212. 

Devlin, J., Chang,M.W., Lee, K. and Toutanova, K. (2018). Bert: Pre-training of Deep Bidirectional Transformers 
for Language Understanding. arXiv preprint arXiv:1810.04805. 

Mohaouchane, H., Mourhir, A. and Nikolov, N. S. (2019). Detecting Offensive Language on A rabic Social 
Media using Deep Learning. In 2019 Sixth International Conference on Social Networks Analysis, 
Management and Security (SNAMS), IEEE, 466-471. 

Mubarak, H., Darwish, K. and Magdy, W. (2017). Abusive Language Detection on Arabic Social M edia. In 
Proceedings of the First Workshop on A busive Language Online, 52-56. 

Mubarak, H., Rashed, A., Darwish, K., Samih, Y., andA bdelali, A. (2020). Arabic Offensive Language on 
Twitter: Analysis and Experiments. arXiv preprint arXiv:2004.02192. 

Mulki, H., Haddad, H., Ali, C. B. and Alshabani, H. (2019). L-H SAB: A levantine Twitter Dataset for Hate 
Speech and Abusive Language. In Proceedings of the Third Workshop on A busive Language Online, 
111-118. 

Ousidhoum, N., Lin, Z., Zhang, H., Song, Y. and Yeung, D. Y. (2019). Multilingual and M ulti-aspect H ate 
Speech A nalysis. arXiv preprint arXiv:1908.11049. 

Salem, F. (2017). The Arab Social Media Report 2017: Social M ediaand theInternet of Things: Towards Data- 
Driven Policymaking intheA rab World. M BR School of Government, Dubai. 

Soliman, A. B., Eissa, K. and El-Beltagy, S. R. (2017). Aravec: A set of Arabic Word Embedding M odes for Use 
inArabic NLP. Procedia Computer Science, 117, 256-265. 


Cite this articleas: Ahmed Fahmy. (2022). Detecting OffensiveLanguagein M ulti-Dialectal A rabic Social 


Media. International Journal of Data Science and Big Data Analytics, 2(1), 20-25. doi: 10.51483/ 
1JDSBDA .2.1.2022.20-25. 


